Abstract 

Side-channel attacks are efficient attacks against cryptographic de- 
vices. They use only quantities observable horn outside, such as the du- 
ration and the power consumption. 

Attacks against synchronous devices using electric observations are 
facilitated by the fact that all transitions occur simultaneously with some 
global clock signal. 

Asynchronous control remove this synchronization and therefore makes 
it more difficult for the attacker to insulate interesting intervals. In ad- 
dition the coding of data in an asynchronous circuit is inherently more 
difficult to attack. 

This article describes the Programmable Logic Block of an asynchronous 
FPGA resistant against side-channel attacks. Additionally it can imple- 
ment different styles of asynchronous control and of data representation. 



1 Introduction 

Side-channel attacks (SCA) have been put forward mainly by Paul Kocher 
et al. in 1996 in [l^. This first description of a SCA explained how the mere 
observation of the duration of computations could allow an attacker to retrieve 
the secret key. The attack was then improved and extended to other cryptosys- 
tems iiillllii- 

In 1999 Kocher et al. described what they called "dpa 0" [111. This new 
attack used the power consumption instead to the duration but yielded the 
same result: the retrieval of the secret key. The process of this latter attack 
is relatively simple: a large number of cryptographic operations arc monitored 
and the cipher text stored together with the electric consumption. Then guesses 
were made of some parts of the secret key, which were confirmed or or not by 
a statistical processing the data. Other attacks against various cryptosystems 
were based on this method 0, [i^, [l3| ■ 

Countermeasures soon appeared to protect systems based on a strong alge- 
braic structure 0,(23, 47, 2^. At he protection of opposite symmetric cryptosys- 



tems often consisted in introducing some randomization either in the computing 
process or power consumption to prevent the statistical processin g of the ac- 
quired data. However "counter-countermeasures" also appeared [19[. Some 
other protection schemes were designed 

An interesting and apparently efficient countermeasure is the WDDlH 
which duplicates each signal in the circuit so that whatever the value is. one 
of the lines will toggle. This countermeasure was enhanced by an improved 



routing of related signals [ll| . which reduces the differences between the power 
consumptions of a T' and a '0'. 

Asynchronous circuits, the history of which dates back to 1950, are 
nowadays increasingly considered as a viable alternative to classical synchronous 
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designs. Indeed they feature some very useful properties such as flexibility, 
robustness, high speed and low power. This article brings another good reason to 
consider asynchronous designs: a greater resistance against side-channel attacks. 

Some industrial applications of asynchronous ASIC and FPGA begin to 
appear both in the academic world 2 

IMM and in the industry Q. 



At the same time synchronous circuits are suffering from problems arising 
from the distribution of the clock signal through the IC and the excessive power 
consumption (and thus dissipation!). 

As an asynchronous circuit has no centralized clock, the problems associated 
with the clock distribution, clock skew and power consumption do not exist. In 
addition this circuits offers advantages like: 

• average-time performance, 

• lower electromagnetic radiation, 

• better robustness towards variations of the power voltage, 

• better robustness towards fabrication process variations (3]| . 

• better composability and modularity because of the simple handshake 
interfaces and the local timing [s^ and 

• better scrambling of the side-channel information 3^, 1^, 41 1. 
Asynchronous circuits thus seem to be a viable alternative which would remove 



these limiting factors [35 1 



Due to these advantages, there has been a resurgence of interest in asyn- 
chronous design, especially in the reprogrammable field. There have been sev- 
eral recent successful design projects such as ASPRO-216 ISp. AES crypto- 
processor Q, many of Philips designs targeting low power (sl. l2l|. projects fo- 
cused on designing an asynchronous FPGA from a synchronous one, like MON- 
TAGE and PGA-STC [S^l or targetingasynchronous application-specific 
FPGAs, locally synchronous, like GALSA and STACC 



^ ^^1^, iv^v^uiij iiv.iii v^iiv^ xiivv. ^^^^^^^ ^-^^^^^ j33| or compl etely 

asynchronous like PAPA [H, liil, and other recent works [11, i, 0, El. 
PGA-STC was developed to implement two-phase bundled-data systems such as 
micro pipelines, GALSA for massively parallel computing architecture, STACC 
for reconfigurable computation and PAPA was mainly created and optimized 
for pipe-lined processes. 

This article describe the design of the PLI^ of a new asynchronous FPGA 
with security as the main requirement, even at the expense of performance. 
Indeed in the particular case of cryptography performance is second to security 
even if it cannot be ignored. The FPGA must be able to implement various 
styles of asynchronous protocols and different representations of data so as to 
enable comparisons between these representations and protocols as for their 
ability to thwart the side-channel attacks. 
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Section [5] describes the representation of data and the different asynchronous 
protocols used in the FPGA. We also discuss their suitablity for trusted com- 
puting. Section [3] shows the construction of the PLB to implement the 4-phase 
protocol using both binary and ternary representations of data. Section 2] shows 
the necessary additions to the PLB to accommodate the 2-phase protocols. Sec- 
tion [5] shows how the FPGA is programmed. Finally section [5] concludes the 
article. 

2 Asynchronous Representation of Signals 

As opposed to synchronous data, whose validity is guaranteed by the timing 
of some global "clock" signal, the asynchronous computations are synchronized 
by the availability of data and, when necessary, by a Request/ Acknowledge 
handshake signalling. 

A formal description of delay insensitive representation of data can be found 
in [i^. In the Quasi-Dclay Insensitive (QDI) protocols the request is carried 
by the data itself. This allows to obtain a reliable design, independent of the 
routing. 

The data are transmitted together with the availability information and thus 
a logic signal or, shorter a "signal" , must be represented by more than a single 
electrical signal or, shorter, a "vifire'f^. In this article, a wire is able to take 
one of two values, which we denote and 1 regardless of their actual electric 
implementation. 

In order to avoid glitches, a sufficient condition is that given a signal S 
represented by n wires, the transmission of a new value of S must consist in 
exactly one of the n wires changing its electrical state. This means that the 
number of wires is greater than or equal to the number of the states of S. As 
silicon and routing is a precious resource, the number of wires representing a 
given signal will thus be equal to the number of possible values of this signal. 

The most frequently used kind of signal is the binary signal, which carries 
a {'1','0' } information. Such a signal is encoded with 2 wires. This rep- 
resentation is called "Dual- Rail" or " l-out-of-2" . However ternary signals, 
which carry a {'0','1','2' } information, can also be thought of. Such a signal 
is represented by 3 wires and one speaks of "i-ou<-o/-5" representation. This 
representation is more compact than the l-out-of-2 as for arithmetic: 6 wires 
in l-out-of-2 represent 3 l-out-of-2 signals which can take 8 valid values, com- 
pared to two l-out-of-3 signals, which can take 9 valid values. However due to 
the greater complexity of gates in l-out-of-3 representation, the binary signals 
are most of the time preferred. 

An asynchronous design may need additional signals, which are specialized 
to synchronisation. These signal carry no data information and can thus be 
coded on a single wire. They will be referred to as Acknowledge signals. The 

*If one could work with non-standard electrical levels, a {—5 V, V, -1-5 V} representation 
on a single wire per signal would be acceptable in some cases but we shall restrict ourselves 
in the following pages to standard CMOS levels: Vdd and Vsa- 
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inputs of the gates which receive such a signal will be denoted Sm and those 
driving these signals will be called Sout- 



2.1 Asynchronous Protocols 

There are two main families of QDI asynchronous communication protocols, 
which differ by the nature of the signalling information: the 2-phase protocols 
and the 4-phase protocols. 



2.1.1 4-Phase Protocol 



Under a 4-phase protocol, valid values of a signal are separated by a special 
value, denoted il. The transmission of a value x from an emitter to a receiver 
proceeds as follows: 





Emitter 


Receiver 


1 


sends x — 
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— acknowledges x 


3 


sends O — 
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— acknowledges 



For instance, if a signal S is represented by n wires (So, ^i, , Sn-i), the 
r2 value will be implemented as the n-tuple (0,0,..., 0) while the value i will 
be represented by (0, . . . , , 0, S'i = 1, 0, . . . , 0). 

This particular kind of 4-phase protocol is named "wchb 'ilin [33, Sec. 2.3.1] 
and as dpi0 among the secure computing community [s^ l . 



2.1.2 2-Phase Protocols 

Under a 2-phase protocol, no special value is used to separate valid ones. The 
transmission of a value x from an emitter £' to a receiver R proceeds as follows: 



Emitter 


Receiver 


1 sends x — 




2 ^ 


— acknowledges x 



In this article we will describe the implementations of two 2-phase protocols: 

2-phase-edge protocol: 

a signal S, which can take n values is represented by n wires and the 
arrival of a new value i is signalled by wire i toggling 0^1 or 1^0. 
Note that the instantaneous values of the wires is not significant under 
this protocol: only the toggles are significant. 

^Wcak Condition Half Buffer. 
® Dual- Rail Prcchargc Logic 
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2-phase-ledr protocol: 

a signal S is represented by two wires: Sd and Sr- The arrival of a new 
value X, is signalled by one of Od and Or toggling 1 or 1 and the 
value is given by Od- 

Note that the requirement that any change of the value of the signal 
be implemented by the toggling of exactly 1 wire limits the 2-phase-ledr 
protocol to binary signals. 

Remark 1 The 4-phase protocol can be considered as a 2-phase protocol in 
which all "valid" values are followed by a n dummy value and in which the 
gates return to the il value as soon as all inputs have received the fl value. 
The 2-phase protocols are thus inherently twice faster as the 4-phase ones. This 
is especially important in a FPGA, in which the routing delays are often the 
limiting factor of the speed of the system. However, even if twice faster, they 
lead to much more complex gates than the 4-phase ones. 

2.2 Initialization of the System 

At the initial time of the system's operation, all gates must be reset to a known, 
deterministic value. (This is also true for synchronous systems even if some 
flip-flops sometimes need no initialization.) 

The requirement of a known, deterministic value, implies no specific value to 
the wires. However the simplest initialization, which we shall use in this article, 
consists in initializing all gates so that all wires be set to 0. 

The consequence of this initialization is that the parity of the Hamming 
weight of any signal is just after reset, which implies that its parity is even. 

The relevant property just after RESET is thus that: 

• under a 4-phase protocol an value is thus output by all gates and 

• under a 2-phase protocol the parity of the Hamming weight of the outputs 
of any gate is 0. 

2.3 Request Signalling 

The Request event is coded into the data of the QDI protocol itself; a request 
corresponds to a change of one of the wires encoding the signal. A gate will be 
ready to perform its computation when each of its input have received a request 
and when all gates using its output have acknowledged the last value sent. 

If performance were the major requirement this would not be true: for in- 
stance, a AND gate could perfectly output a '0' as soon as one of its inputs has 
received a '0'. But such an early evaluation would occur only when some in- 
put(s) receive a '0' and never when all receive '1'. This difference in timing could 
potentially leak some information about the computations being performed to a 
malevolent observer. Thus such "early evaluation" will never be allowed in a se- 
cured circuit and computations will always be performed upon the rendez-vous 
of all data and Acknowledge inputs. 
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As the arrival of a new value is always signalled by a single wire changing 
value, the parity of the Hamming weight of any signal changes each time a new 
value is transmitted. 

Under a two-phase protocol, a gate will be ready to compute its output when 
all its inputs show a parity opposed to the current output parity. 

Under a 4-phase protocol a gate is ready to compute as soon as each input 
has left thenfl state. As fl is coded as (0,0,..., 0) is has an even parity while 
any valid value, signalled by a single wire at 1, has an even parity. The behaviour 
of the gates under a 4-phase protocol is thus coherent with the one of the gates 
under 2-phase protocols. This will be useful for the design of the FPGA. 

2.4 Acknowledge Signalling 

The Acknowledge signal consists of a single wire, carrying a { Jl, ack } under 
a 4-phase protocol or an { odd, even } "phase" information, under a 2-phase 
protocol. 

Given the "parity" property of the signals, the Acknowledge signal is com- 
puted as the XOR of all wires carrying the output signal. An OR gate would 
be enough under a 4-phase protocol. However it is easy to show the OR and 
the XOR functions are identical on the allowed domain of values of the wires 
under a 4-phase protocol. 

This signal is sent by a given gate to those which drive its inputs. When 
the output of a gate S is sent to more than one gate, Z?i, -D2,---, a rendez-vous 
is computed to combine the synchronization signals coming from the Di into a 
single signal, fed to S. 

2.5 C-Element 

The C-element is the gate which implements the rendez-vous of signals. It has 
an arbitrary number p of input wires, denoted /i, /2,. . . and a single output 
Z , whose equations are: 



Where A and V are respectively the AND and OR operators. 

Fig. [1] depicts the implementation of a C-element derived form Eq. [21 using 
a multiplexer (MUX), which we use in out FPGA. 

In an FPGA the C-element can be implemented in many ways. A p-input 
C-element can be implemented in p + 1-input lut, provided the output of the 
LUT can be fed back to one of the inputs. 




(1) 



Eq. [2] shows an equivalent form of Eq. [T] 



Z= (ZA(/i V---V/p))V(/i A-- 



A/p). 



(2) 
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Figure 1: C-element implemented with a MUX. 



If Z = 0, the MUX selects the AND gate, which will output 1 if and only if 
Mi G [l,p],Ii = 1. When this condition becomes true, the output of the MUX 
becomes 1 and the output of the OR is selected to be sent to Z instead of the 
one of the AND. As Vi,!; = 1 ^ 3i : 0, the output is stable at 1. The 
output remains 1 until all inputs arc back to 0. Mutatis mutandis the same 
proof shows that the output of the gate conies back to when all inputs are 
and that this value is stable until all inputs are 1 again. Thus the gate correctly 
implements the rendez-vous with no glitch. 

2.6 Asynchronous Computation &; Security 
2.6.1 Timing Attack 

As each gate always waits for every input to be ready before computing its 
result, the duration of the computations is independent of the data. However a 
dependency can be generated if the lengths of the wires Xi which implement a 
signal X arc different, thus generating different propagation times for each value 
of X. 

Thus the following necessary condition must hold: for any pair of gates 

{S, R), connected by a signal x, composed of wires (xq, xi, . . . , Xp): 

• under the 4.-phase protocol, the propagation time of the transition from 
rj to any value and of the transition from any valid value io S to R 
must be independent of the value; 

• under a 2-phase protocol, the rising and falling times of any output wire 
must be equal and independent from the former and next value of the 
signal. 

As the condition must be fulfilled by any signal routed through the FPGA, 
this implies that: 

• in any routing channel, all wires must have the same length and the same 
capacity with respect to Vdd or Vss, 

• for any pair of wires in two routing channels connected by a switchbox, the 
propagation time through the switchbox must be the same for all possible 
pairs. 
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• for any input of a PLB, the propagation time from the network to the 
processing elements must be uniform, 

• for any output, the propagation time to the routing network must be 
uniform. 

If all these conditions are satisfied and if all PLB process information at the 

H, @, [13, is impossible. 



same speed the timing attack 23. 



2.6.2 Measurement of Power Consumption 

Under the 4-phase protocol, two valid values are separated in time by a value, 
implemented as all wires at 0. The transition from to a valid value i consist 
in a rising edge i-^ 1 of wire i and the return to ft is the opposite falling 
transition. 

In order to thwart these attacks the power consumption must be the same 
for the rising edge of any of the wire Xi which compose a signal x and also for 
their falling edges. This condition implies that lengths of the Xi through the 
routing network be the same. 

The necessary conditions to thwart the timing attack are also necessary here 
but, in addition the resistances of the output transistors must be equal. 



3 4-Phase Protocol 

This protocol is the simplest of all three because the instantaneous values of the 
wires composing any signal are sufficient to determine the value of this signal. 
We will implement the gates with: 

• from 1 to 6 inputs, including the Sin signals, and 

• from 2 to 4 outputs, not including the Sout signals. 



3.1 Encoding of Signals 

Though it is not the only possible one, we shall use the one of Eq. [3] for a signal 
X in the rest of this article: 

if(a;o,a;i) ={Q,0):x = n, 

if (xcm) =(l,0):2; = 'O', 

if (a;o,a;i) =(0, 1): a; = '1' and 

if ((a;o, Si) = (1, 1): forbidden state. 

The occurrence of the "(1,1)" forbidden state will always signal either a 
malfunction or an attack against the system. Fig. [2] depicts the succession 
of values on a signal X, represented by 2 wires (a;i,a;o), and, when present, 
the associated transmissions of the ACK signal by the receiver back to the 
transmitter. 
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Figure 2: 4-Phase Protocol. 



3.2 l-oMi-o/-^,2-input Gates 

Let f{x, y) : F2 X F2 1-^ F2 a two-variable Boolean function. Its output is a 
l-out-of-2 signal represented by two wires Oi and Oq- We denote respectively 
f^{x,y) and f^{x,y) the functions computing the values of each wire. 

Fig. [3] depicts the minimal structure of a PLB necessary to implement in 
the most general way a gate with 2 binary inputs. Three signals enter the gate: 
2 data signals x and y, respectively implemented by the {xq,xi) and (yoij/i) 
pairs of wires, and Sim the synchronization signal. 

The output value (O) is implemented by two 6 1— > 1 lut, respectively com- 
puting the Oo and Oi wires. Eq. |4] shows the equations of the outputs. In this 
equation, 





if (.T ^n) ^{y^n) ^ {s,. 


= 0) 





if {x ^n)A{y = n)A {S^n 


= 1) 


Oi 


otherwise. 




f{x.y) 


if {x 7^ O) A (y 7^ O) A (5,„ 


= 0) 





if {x^n)A{y^n)A {Sra 


= 1) 


Oi 


otherwise. 





Sout = Oo®Oi. 

The "memory effect" implied by Eq. |3]is implemented by sending each of 
Oq and Oi to an input of the lut which drives it. Thus the minimal practical 
size for the lut is 64 bits, which can implement any 6-bit 1-^ 1-bit function. As 
there are two output bits the minimal size of the PLB is 2 lut. 

Even if an OR gate would be enough, the Sout signal is computed by a XOR 
gate (SeeHH). 

As the inputs to the lut are the same, with the exception of the feedback 
wires, there can be a single connection box to the routing network, which will 
divide by 2 the total size of the connection boxes. Fig. [3] shows the minimal 
structure of the PLB, which allows to implement 2-input gates with synchro- 
nization. 
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Figure 3: Minimal PLB for ^-phase, 2-input gates. 



Remark 2 In Eq. each wire of x and y is loaded with exactly the same 
number of inputs, as it is necessary to achieve the indiscernability of signals for 
a malevolent observer. 

3.3 l-out-of-2, 3-input Gates 

Eq. [4] can be immediately modified into Eq. [5] to add a third input term z and 
the new equation shows that we need a 7-input lut with one feedback. 

if {x^n)Aiy^n)Aiz^n)A is,n = 0) 
Oi ={0 if (x = r2) A (z/ = r2) A (z = r2) A {S,n = 1) 

otherwise. 

if {x^fn)A{y^n)A{zj^n)A {Sin = 0) (5) 
Oo ={0 if{x = n)A{y = n)A{z = n) A (S^n = 1) 

otherwise. 

Sout - 

As the 3-input gates need 6 inputs for a 3-variable function, they cannot be 
implemented in the structure of Fig. [H on which each 6 i-^ 1 lut has 5 inputs 
from the routing network and 1 feedback input. 

As it is not realistic to use two 7 i— > 1 LUT because of the number of program- 
ming points (2 X 128 bits), we separate the rendez-vous + computation function 
from the memory function and introduce a specific component: the memory 
point. 

Fig. m depicts the memory point, which consists in a pair of C-elements, 
together with a XOR gate, which computes the Sout signal. Two MUX, under 
control of a single programming point, allow to bypass the C-elements. It will 
be useful when implementing the 2-phase protocols. 

Fig. [5] depicts the schematic of the 2-input l-out-of-3 gate. The ancillary 
"return to il" function is implemented by a specialized 6-input OR gate while 
the 6 t— > 1 LUT are programmed to compute the rendez-vous and the functions 
F'^{x,y,z\ and f'^{x,y,z). 
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Figure 5: Binary 3-input gate with 4-phase protocol. 
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Figure 6: Structure of PLB needed to implement a ternary 2-input gate. 



Remark 3 Note that it is much better use of the LUT than the one implied by 
Fig. [3 in which all bits corresponding to the feedback input set to 1 are filled 
with '1 ' to implement the inclusive OR of all 4 input bits. 

The 4-LUT PLB can implement two independent 3-input, l-out-of-2 func- 
tions. Ex: a full-adder. 

Remark 4 The wiring depicted by Fig. can handle any gate the inputs of 
which sum up to 6 wires (Ex: one Sin + one l-out-of-2 input -h one l-out-of-3 
input; two Sin, two l-out-of-2 inputs, etc.). 

Remark 5 The feed-back and the associated MUX at the inputs of LUT could 
be removed. However they will be useful later for the implementation of the 
2-phase-ledr protocol. 

3.4 l-out-of-3, 2-input Gates 

Just as the l-out-of-2, 3-Input Gates, the l-out-of-3, 2-input gates need 6 inputs 
but they need three outputs, each of them equipped with a memory point. 
Strictly speaking, a l-out-of-3 gate needs three lut, each of them implementing 
one of the functions Oi = y), « = 0, 1, 2. 

However as most of the gates in a design will still be binary, the PLB fea- 
tures four 61-^1 LUT. One of them will remain unused and filled with when 
implementing a l-out-of-3 gate. The computation of the Sout signal needs some 
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specialized hardware. Fig. [S] depicts the new PLB needed for a 2-input 1-out- 
of-3 gate, with the supplementary devices gate in a grey rectangle: 

• a MUX, controlled by a programming point, which allows to use the PLB 
either as two separate 2-binary input, binary output gates or a single 
combined gate and 

• a single XOR gate which computes the XOR of all four outputs of the 
memory points. 

For the same reason of compatibility with the binary gates, the inputs to 
the pairs of lut are split into two groups. The load to each of the 12 input 
wires is exactly the same, thus equalizing the power consumptions of all possible 
transitions on inputs. 

Remark 6 The OR gates which compute the "return to Jl " signal are not 
grouped but will compute output the same value as their input are the same. 

3.5 Conclusion as for the 4-Phase Protocol 

In order to implement 2- and 3-inputs gates under the ^-phase protocol, the 
PLB must at least consist of four 6 i— > 1 lut, named Lqj ^Ij L2 and L3. One 
input of each lut can be replaced with a feedback signal equal to the output 
pin. 

The schematic depicted on Fig. [5] is general: it can implement any gate 
with: 

• inputs consisting in any combination of 6 wires or less, including the Sin 
signals, and 

• outputs consisting of any combination of 4 wires, not counting the Sout 
signals: 2 binary outputs, with separate Sout signals, 1 ternary output 
with a single acknowledge-out signal or 1 quaternary output with an Sout 
signal. 

4 2-Phase Protocols 

4.1 Phase of a Signal 

Under the 2-phase protocols valid values of a signal arc not separated by "fJ" 
markers. However, as the arrival of a new value (possibly identical to the pre- 
ceding one) is indicated by the toggling a exactly one wire, the parity of the 
Hamming weight of the wires which represent a signal toggles at each new data. 

In the following pages, the phase of the signal X, denoted is by 

definition, the parity of the Hamming weight of the wires representing X. 

Remark 7 For Acknowledge signals, which consist in a single wire, the phase 
is equal to the value of the wire itself. The name of an Acknowledge signal A 
will thus be used instead of (/>( A) . 
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Figure 7: Transmission of a Signal and the acknowledge under the 2-phase-ledr 
protocol. 



At the beginning of the computation, all wires are set to a known value. 
2-phase protocols require that, after initialization and before any computation 
is started, the parities of all signals be the same, say even. A simple way of 
ensuring this even parity is to initialize all wires to 0. 

As the phase of a signal toggles with every new valid value, a given gate is 
ready to compute its output when the phases of all "data" signals at its inputs 
are the same, different from the current phase of the output and the phase of 
the Sin signal, if present, the same as the output phase. 

After the gate has performed its computation, the phase of its outputs be- 
come the common one of the data inputs and thus the Sout signal toggles. 

4.2 2-Phase, LEDR Protocol 

This protocol is referred to as "level-encoded dual-rail" , or LEDR • 
4.2.1 Transmission of a Signal 

Fig. [7] shows the transmission protocol of the successive values of a signal, 
together with the acknowledge signal. One can see that: 

• a signal X is represented by two wires: the "data wire": {Xd) and the 
"repeat" wire: (Xr); 

• each time a value is sent, exactly one wire toggles; 

• the value of the signal X is the value of the Xd signal, thus the oncoming 
a a new value, different from the preceding one is signalled by the toggling 
of Xd-, 

• the oncoming of a new value, identical to the preceding one is signalled 
by Xr toggling; thus the instantaneous value of Xr is irrelevant, only its 
toggling are significant. 

Remark 8 The 2-phase-ledr protocol is restricted to binary signals. Otherwise, 
the transition between two values would imply that more than a single wire toggle. 
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Figure 8: 2-input Gate under the 2-phase-ledr protocol. 



4.2.2 Binary 2-input Gates 

Let f{x, y) : F2 X F2 I— > F2 a two-variable Boolean function. The inputs are 
represented by 4 wires: Xd, Xr, yd and yr, to which a synchronization signal, 
Sim may added and the output signal O represented by two wires: Od and Or, 
together with an acknowledge output Sout- 
The equations of the output wires are: 



Od = 



fxd.yd) 


if {ct>{x)) = 


0) A {cb{y)) 


= 0) A (5,„ 


= 1), 


fxd.yd) 


if {(t>{x)) = 


1) A {cf{y)) 


= 1) A {S,n 


= 0), 


Od 


otherwise, 








fxd,yd) 


if {(fix)) - 


0) A {cf{y)) 


= 0) A {Srn 




.fxd,yd) 


if {(fix)) = 


1) A {Hy)) 


- 1) A (5,„ 


= 0), 


Or 


otherwise. 









(6) 



Or 

Sout — Od® Or- 

Eq. [S] shows that each of {Od, Or) is a a function of 6 variables: 

• two input data signals, represented by 4 wires, 

• one Sin signal, represented by a single wire and 

• one feed-back signal, also 1 wire. 
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These functions can be implemented in the same hardware as the corre- 
sponding gate under the 4-phase protocol. Fig. [8] shows the assignment of the 
wires. 

The hardware elements which are not used to implement this gate are rep- 
resented in dashed lines: 

• the 6*'' input to the 6 1 lut, which is replaced by the feed-back, 

• the 6-input OR gate, 

• the memory element, which is programmed as "transparent" using its 
internal programming point (See Fig. [¥]). 

Note that, opposite to the case of the 4-phase protocol, here, the Sout value 
must be computed by a XOR gate. 

4.2.3 3- input Gates 

Let f{x, y, z) : F2 ^ F2. Eq. [7] shows the expressions of the output wires. 



/(x, y, z) if {<j,{x) = 1) A (0(y) = 1) A (0(z) = 1) A = 

/(x, y, z) if = 0) A (0(y) = 0) A (0(z) = 0) A (5™ = 

Od otherwise, 

fix,y,z) if = 1) A (0(y) = 1) A {cb{z) = 1) A (5„, = 

fix, y, z) if = 0) A (^(y) = 0) A (</)(z) = 0) A (5„ = 

Or otherwise. 



Or 

Sout = Od ® Or ^ 

(7) 

Eq. [7] shows that each of Od and Or is a variable of 7 input variables and 
cannot thus be implemented in a 6 i-^ 1 LUT. 



4.2.4 Practical Implementation 

Under the 4-phase protocol the outputs were set back to by the rendez-vous of 
the coming from the lut and the coming from the 6-input OR gate. Under 
the 2-phase protocol a OR gate cannot express the "return to 0" condition. 
Therefore the wiring of Fig. [5] is modified according to Fig. [9l 

Two MUX, controlled by a programming point, are added, which allow to 
replace the 6- in OR gate by the two other 6 1— > 1 lut of the PLB. This way, 
each of Od and Or is now a rendez-vous of the outputs of 2 lut: 

Od = rendez-vous{Lo, L2) 
Or = rendez-vous{Li, L3) 
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Oo 
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Figure 9: Implementation of the 3-input gate under the 2-phase protocol. 



Eq. [5] shows the programming of lut Lq and L2 and Eq. [5] shows the 
programming of LUT Li and L3. 



Ln = 



f{xd,yd,zd) 
f{xd,yd,zd) 


f{xd, Vd, Zd) 

f{xd,yd,Zd) 
1 



if = 0) A {cPiy) 

if {cf,{x) = 1) A {cf>{y) 

otherwise, 
if (^(a;) = 0) A (</)(y) 
if {cl>{x) = 1) A (<^(y) 

otherwise. 



0) A ((/.(z) = 0) A (5,„ = 1), 

1) A (0(z) = 1) A (5„ = 0), 

0) A ((/.(z) = 0) A (5„ = 1), 

1) A ((/.(z) = 1) A (5,„ = 0), 



(8) 

When the conditions for a transition are fulfilled, Lq and L2 have the same 
value. Thus the rendez-vous occurs and Or takes its new value. Otherwise 
Lq ~ and L2 ^ 1, the C-element within the memory element has different 
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Figure 10: Global Structure of a 2-input 2-Phasc-Edge gate. 



values on its inputs and Od is locked. 



^3 = = < 



fixd,yd,Zd) 
f{xd,yd, zd) 


f{xd,yd,zd) 
f{xd,yd,zd) 
1 



if = 0) A (<^(y) 

if {^{x) - 1) A (^(j/) 

otherwise, 
if = 0) A ((^(y) 

if (0(x) = 1) A {cPiy) 

otherwise. 



0) A ((/.(z) 


= 0) A (5,„ 


= 1), 


l)A(0(z) 


= 1) A (5„ 


-0), 


0) A ((/.(z) 


= 0) A (5,„ 


= 1), 


1) A (0(z) 


= 1) A iS,n 


= 0), 



Mutatis mutandis the same demonstrations shows the validity of Or 



(9) 



4.2.5 Conclusion on the 2-Phase, LEDR Protocol 

Apart from the shaded area in Fig. [9] the 2-phase-ledr protocol needs the same 
resources as the 4-phase protocol. 

As for security, all inputs to the gates have an equal load but the value of a 
signal X is the value of one of Xd- This is a potential security risk, which will 
have to be investigated as soon as the ICs have been delivered. 



4.3 2-Phase, Edge Protocol 

Signals under the 2-phase-edge protocol can take an arbitrary number of values. 
Binary signals are represented by 2 wires, ternary signals are represented by 3 
wires, etc... However the complexity of the gates is quadratic in the number of 
wires per signal. Thus the use of this protocol is in practice limited to binary 
signals. The complexity of the gates is also quadratic in the number of inputs. 
Again this limits in practice the number of inputs to 2. In the sequel signals are 
binary and a signal X is thus represented by 2 wires: (xo,a;i). 

The coding of the signals relies exclusively on toggling of wires, the instan- 
taneous values of the wires is always irrelevant. This means that the current 
state of 4 wires has to be stored. Thus even for a 2-input gate all four lut of 
the PLB will have to be used. 
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Figure 11: 2x2 — DW. 



4.3.1 Structure of a Gate 

The global structure of a 2-input gate under the 2-phase-edge protocol is de- 
picted by Fig. [TOl The operation of the gate is divided in three steps: 

Detection: waits for an edge on Ai and one on Bj and toggles the corre- 
sponding Cij, 

Computation: toggles If{ij) and 

Synchronization: toggles Oji^ij^ and Sout if and only if Sin bas toggled since 
the last data output. 



Detection: 2x2-decision wait The detection and the decoding of the in- 
put data is performed by the circuitry known as the "2x2-decision wait" or, 
shorter, the "2x2 — Dw" . The circuitry, shown on Fig. [11] works as follows: 

1. assume an initial state such that, for each C-element, the inputs are equal, 
(as this is the initial state, with all wires set to 0, the recurrence can start), 

2. after an input value i e {0, 1} has arrived on input port A and an input 
value j S {0, 1} on input port B, Ai and Bj have toggled (double-thickness 
continuous lines) 

3. at this point: 

- one input to Ci^i-j has toggled C'i.i-j is unchanged, 

- one input to Ci_i.j has toggled =^ Ci-ij' is unchanged, 

- both inputs to Cij have toggled Ci,j toggles, 

4. the new value of Ci,j is sent to the next stage and to the appropriate 
XOR gates to cancel the unwanted toggUng of Ci^i-j and Cis-j (double- 
thickness dashed lines), 

5. all four C-elements now have their inputs identical, which was the initial 
situation and Ci_j has toggled, indicating to the next stage that: 



19 



• both input ports A and B have received a new data, 

• the data just arrived on A was i and 

• the data just arrived on B was j. 

Each of the Cij can be expressed as: 

Cij = rendez-vous{Ai © Ci,i-j, Bj Ci-ij) (10) 
Each of the Cij is a 5-term expression depending of: 

• three feedback hues: Ci,j (itself), Ci,i-j and Ci-ij and 

• two input hncs: Ai and Bj. 

Though the expression would fit in a 6 i-^ 1 lut, the feedback from one 
LUT to the other would have to be routed through the general routing network, 
which has the following drawbacks: 

• it consumes routing resources, 

• the timings of the feedbacks will be different between the feedback of a 
LUT to itself (which is routed inside the PLB) and other, routed outside. 
This could be an attack point; 

• the 2-input gate will always need 2 PLB: one for the 2x2 — DW and one 
for the computation itself. 

Therefore, these feedback have been added to the PLB, as shown on Fig. 
[T^ which. As on preceding figures, the black triangles at the inputs of the lut 
are MUX controlled by programming points, which are denoted by "[./.]" in Eq. 

El 

With these notations the equations of the 4 6^1 lut are: 



io = LUT ( [/^/Lo] , [I'l/Li] , [I^/L2] , [IUL3] , /'4 , /'5) ) 

Li = LUT ( [I^/Lo] , [/(/Li] , [iyL2] , [IUL3] , /'4 , /'5) ) 

L2 = LUT ( [/^/Lo] , [/f/ii] , [47^2] , [/^/ia] , /"4 , /"5) ) 

L3 = LUT ( [/^/Lo] , [Jf/ii] , [I^/L2] , [I^/L^] , /"4 , /"5) ) 



(11) 



To implement the 2x2 — DW the input lines are assigned as in Eq. [T^ and 
depicted by Fig. [T^ 

II, ^NC I[^NC I^^Bi I'^^Bo I'^^Aa r^^Ai 

I'^^B, I'^^Bo I^^NC I'i^NC I'l^A^ I'^^A, ^''> 

in which ^'NC" means "not connected" and Eq. [T51 shows the interconnection 
of the feedbacks needed to implement the 2x2 — DW. 

Lo = C'o.o = lut( Co,o , Co,i , Ci,o , -Bo ,Ao,Ai ) 

Li = Co,i= lut( Co,o , C'o.i 1 Bi ,Ci^i , Aq , Ai) ^^^^ 

L2 = Cifl = lut(Co,o, Bq ,6*1,0,^14,710,^1) 

L3 ~ Cis ~ lut( Bi , Co,i , Ci^o , Ci^i ,Ao,Ai) 
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Figure 12: PLB with all feedbacks for the 2-phase-edge protocol. 
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Figure 13: Wiring used to implement the 2 x 1-decision-wait. 
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Figure 14: 2 x 1-dccision-wait. 



Remark 9 Ai is useless to compute Co,o andCa.i and Ai^ is useless to compute 
Cifl and Ci,i. The reason why these inputs are connected to the network but 
ignored in the programming of the LUT is that Bq and Bi are connected twice 
from the network to the PLB and that the loads on this network must be identical 
for both variables. 

Computation & synchronization The 2x2 — DW stage provides a de- 
coded output: Ci.j toggles if i and j data have arrived on inputs A and B 
respectively. 

Computing the outputs is then straightforward: each of Oi and Oq outputs 
is the XOR of the relevant Ci,j. Let's see some examples: 



Gate 


Oi 


Oo 


AND 


Ci,i 


Co,0 © Co,! © Ci,o 


NAND 


Co,o ® © Ci,o 


Oi,i 


OR 


Ci,i ® Co4 © Ci,o 


Co,o 


NOR 


Co,o 


Ci,i ©Co,i ©Ci,o 


XOR 


Co,i © Ci.o 


0*0,0 © Oi,i 


NXOR 


Co,o © Cia 


0*0,1 © Oi,o 



Synchronization The synchronization is performed by a device called "2 x 1- 
decision-wait" (or, shorter: 2x1 — Dw). Fig. [HI depicts the schematic of the 
2 X 1 — DW. 

The 2 X 1 — DW works as follows: 

1. In the initial state, the following relations hold: Oi ~ Ii, Oq = Iq and 
Sin = Oo © Oi, which imply Ji ^ Ii and Jo ^ lo- (because Ji = 
Sin © Oo = Oo © Oi © Oo ^Ch=h, idem for Jo); 

2. Assume Ii toggles and thus becomes equal to Ji, the C-element transmits 
the common value of its inputs to Oi, 

3. as Oi toggles, Ji_i toggles too and becomes equal to Oi_j. 

4. until Sin toggles, we have Iq = Jo and Ji = Ji: even if one of the inputs 
toggles, the C-elements will remain stable; 
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Figure 15: Wiring used to implement the 2 x 1-decision-wait. 



5. when Sin toggles, Jq and Ji toggle and the system is back in the initial 
state. 

If one wants to combine the computation stage with the 2x1 — DW, it can- 
not be done in 2 lut. 

It is not because of the complexity of the functions: each of Oi and Op is a 
function of 2 feed-backs, 1 Sin and at most 3 C^j-, at least if one does not want 
to implement trivial, constant functions. 

However, the set of 2 lut together would need 2 feed-backs, 1 Sin and 4 
Cij, which is one more than the number of available wires. Thus we must use 
a fuU PLB. 

If we use the full PLB, the memory element will provide the necessary C- 
element and the lut become purely combinatorial. The inputs will be assigned 
following Eq. [14] and depicted on Fig. [151 

= (Co,o,Co,i,Ci,o,Ci,i) and I'l = S,n (14) 

Then the lut are programmed as by Eq. [15] 

La = /^(Co,o, C'o.i, Ci,o, Cia) 

Li = /°(Co,o, C o,i, Ci,o, Ci,i) , , 

L2 - Oo ® S,n ^ ' 
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Acknowledge signals 



Figure 16: FIFO memory for programming. 



4.3.2 Conclusion on the 2-Phase, Edge Protocol 

The 2-phase-edge protocol is difficult to implement in a FPGA without special 
hardware added to the PLB: it takes two PLB to implement a single 2-input 
gate. 

However this protocol has advantages as for security because the instanta- 
neous value of the wires is not significant in itself. For instance '1' is represented 
alternatively by the rising and the falling edge of a given wire. An attacker try- 
ing DPA, for instance, would have to exhibit the difference between the average 
consumption of both edges on wire '1' and the same average on wire '0'. 

5 Programming the FPGA 

The FPGA can be partially programmed: it is divided in square blocks which 
can be programmed separately from the other. 

The programming chain is a set of asynchronous FIFO memories. An ele- 
mentary stage of these FIFO is depicted by Fig. [111 

At RESET time, all C-elements are set to zero by a general RESET wire. 
Then the programming bits are fed to the FIFO, separated by H values. The 
last stage of each FIFO is particular: the Acknowledge signal is controlled by 
an external pin. During the programming of the block, the Acknowledge signal 
is held low. This way the programming bits are stacked in the FIFO and the 
FPGA becomes functional. 

If a partial reconfiguration is wanted, the chosen blocks are cleared by allow- 
ing the Acknowledge signal of their last stage to acknowledge the value in the 
last stage. Then the FIFO is activated again until all bits have gone thought it. 
At this point, the Acknowledge signal is blocked again and the FIFO is ready 
to receive a new set of configuration bits. 

During the configuration of the FPGA, all outputs of PLB are kept at to 
avoid short-circuits. The PLB are programmed first, while all switchboxes are 
left in an insulation mode. Then the switchboxes are programmed to connect 
the newly reconfigured part to be connected to the still working part. It is the 
designer's responsibility to ensure that the new part can create no conflict with 
the existing part. 
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6 Conclusion 



Wc have presented the programmable logic bloek of an asynehronous FPGA, 
which is oriented towards security rather than performance. In particular we 
have chosen not to implement one of the advantages of an asynchronous design, 
which usually allows to compute in average time: the early evaluation. This 
choice is deliberate as early evaluation is a security risk [i^ . 

The FPGA can accommodate various sizes of data as well as various styles of 
asynchronous control, thus making it possible for the end user to design mixed 
styles of logic, depending on the applicative requirements. Incidentally, this 
FPGA is also a valuable prototype that allows to perform comparisons between 
styles of asynchronous protocols. 

A silicon is being manufactured and will be used for intensive testing. The 
different resistances of the various protocols against SCA will be evaluated. In 
particular the strict link under the 2-phase-ledr protocol between the value of a 
signal X and the one of the Xd wire will decide whether this protocol is suitable 
at all for a secure implementation. 
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